Stochastic Gradient Descent (SGD)#
Stochastic Gradient Descent is an optimization algorithm that differs from the usual Gradient Descent in that the gradient of the optimized function is considered at each step not as the sum of gradients from each element of the sample, but as the gradient from one randomly selected element
Gradient Descent (GD) review#
Before going into the study of stochastic gradient descent, let’s recall the formula of ordinary gradient descent. We have the following:
\(\boldsymbol w\) = \({(w_1, w_2, ..., w_n)}^{T}\) is a vector of parameters, where \(n\) is the size of the \(\boldsymbol w\)
\(\nabla \mathcal L(\boldsymbol w)\) — gradient of the function (the meaning is shown (1))
\(Q(\boldsymbol w)\) — function which we tend to minimize where with respect to “\(\boldsymbol w\)” or
\(\ell\) — number of functions
Choose \(\boldsymbol w^{(0)}\) — initial vector approach. Then, each subsequent vector of parameters will be calculated as
where \(\eta\) — gradient step or learning rate (responsible for how much to change the vector of parameters in the direction of the gradient)
The stopping of the algorithm will be determined by the convergence of \(Q\) or \(\boldsymbol w\). The graph below represents the Gradient Descent algorithm (GD) for the function \(f(x) = x^2 - 5x + 5\) with \(\eta = 0.85\).
Stochastic Gradient Descent (SGD)#
GD algorithm problem#
Considering the previous algorithm (5), we come to a problem. To determine a new approximation of the vector \(\boldsymbol w\), it is necessary to calculate the gradient of each element i.e. \( \sum\limits_{i=1}^\ell \nabla \mathcal L_i(\boldsymbol w)\). In case of application in Machine Learning (ML), if there is a huge number of \(\ell\) (input data), it takes a very long time to calculate all the gradients and can significantly slow down the algorithm since it is very resource - intensive. During one iteration, the GD algorithm performs the summation of \(\ell\) functions’s gradients with the \(n\) sized vector \(\boldsymbol w\). So, the difficulty for 1 iteration in terms of Big O notation is O(\(\ell n\)).
SGD derivation#
In order to improve the efficiency of the algorithm, at each step, instead of calculating the whole sum of gradients (\(\sum\limits_{i=1}^{\ell} \nabla \mathcal L_i(\boldsymbol w)\)) one can be considered (\(\nabla \mathcal L_k(\boldsymbol w)\)), i.e.
According to the formula (6) during the 1 iteration we calculate the gradient of 1 selected function \(\nabla \mathcal L_k(\boldsymbol w)\) with \(n\) sized vector \(\boldsymbol w\). It means that Big O is O(\(n\)).
where, \( 1\leqslant k \leqslant \ell\) (usually random)
Did you know that…
“SGD” abbreviation is not just a Stochastic gradient descent. It is also known as short for the Singapore Dollar, which looks like this

By the way, the exchange rate of 1 Singapore dollar (SGD) at the time of writing this topic is 344.38 tenge (tg)
SGD Realization Pseudocode#
Algorithm (Stochastic Gradient descent)
To solve the optimization problem
do the followind steps: Choose learning rate \(\eta \) (0 < \(\eta \) < 1) and rate of forgetting \(\lambda\) (0 < \(\lambda\) < 1 )
Initialization of \(\boldsymbol w\) with some initial values (e.g., from \(\mathcal N(0, 1\)))
Initial calculation of the function: \(Q (\boldsymbol w) := \frac 1\ell \sum\limits_{i=1}^\ell \mathcal L_i(\boldsymbol w)\)
While \(Q\) doesn’t converge or \(\boldsymbol w\) doesn’t converge DO:
Choose random index value \(k\) where, \(1 \leqslant k \leqslant \ell\)
Recalculate vector: \(\boldsymbol w := \boldsymbol w - \eta \nabla \mathcal L_k(\boldsymbol w)\)
Recalculate function: \(Q := \lambda \mathcal L_k(\boldsymbol w) + (1-\lambda) Q\)
It is necessary to understand that the phrase "While $Q$ doesn't converge or $\boldsymbol w$ doesn't converge DO" in SGD algorithm means that we continue to recalculate $Q$ and $\boldsymbol w$ while at least one of them will have approximately the same parameters at the current step (i.e. almost unchanged) as at the previous one.
Derivation of the formula \(Q = \lambda \mathcal L_k(\boldsymbol w) + (1-\lambda)Q\)…
Here, \(Q = Q_{\ell}\)
We can firstly represent the \(Q\) as: \mathcal L_k(\boldsymbol w) $\( Q_{\ell} = \frac 1{\ell} (\mathcal L_{\ell}(\boldsymbol w) + \mathcal L_{\ell - 1}(\boldsymbol w) + \mathcal L_{\ell - 2}(\boldsymbol w) + ... + \mathcal L_1(\boldsymbol w)) \)$
Multiplying the whole formula by \(\ell\) and moving \(\mathcal L_{\ell}(\boldsymbol w)\) to the left we obtain:
Repeating the first (1) step for the \(Q_{\ell - 1}\):
It can be seen that we can substitute \(\ell Q_{\ell} - \mathcal L_{\ell}(\boldsymbol w)\) instead of \((\mathcal L_{\ell - 1}(\boldsymbol w) + \mathcal L_{\ell - 2}(\boldsymbol w) + ... + \mathcal L_1(\boldsymbol w))\) and get:
Opening the brackets and expressing \(Q_{\ell}\) from there finally:
where, \(\lambda = \frac 1{\ell}\)
The graph below represents how the SGD algorithm works with different \(\eta\) (learning rate) for the function \(f(x) = x^2\). It tries to converge to the optimum (\(x = 0\)) when 0 < \(\eta < 1\) and diverges as \(\eta > 1\). In case of \(\eta = 1\) it is looped and if \(\eta = 0\) then it doesn’t move.
WARNING:tensorflow:From C:\Users\temir\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.
WARNING:tensorflow:From C:\Users\temir\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\optimizers\legacy\optimizer_v2.py:806: The name tf.executing_eagerly_outside_functions is deprecated. Please use tf.compat.v1.executing_eagerly_outside_functions instead.
SGD Exercise:#
Given:
\(\boldsymbol w = (w_1, w_2, w_3) ^ T = (1, -1, 0.5) ^ T\)
\(\eta = 0.1\)
\(\lambda = 0.8\)
\(\mathcal L(\boldsymbol w) = w_1^{2} + w_2^{2} + 0.5 w_3^{2}\)
What is the sum of all parameters of the \(\boldsymbol w\) in the next iteration?
display_quiz("q_w1")
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[4], line 1
----> 1 display_quiz("q_w1")
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\jupyterquiz\dynamic.py:176, in display_quiz(ref, num, shuffle_questions, shuffle_answers, preserve_responses, border_radius, question_alignment, max_width, colors)
173 else:
174 #print("File detected")
175 script += f"var questions{div_id}="
--> 176 with open(ref) as file:
177 for line in file:
178 script += line
FileNotFoundError: [Errno 2] No such file or directory: 'q_w1'
display_quiz("#q_w1")
display_quiz("q_w1")
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[19], line 1
----> 1 display_quiz("q_w1")
File c:\Users\temir\AppData\Local\Programs\Python\Python311\Lib\site-packages\jupyterquiz\dynamic.py:176, in display_quiz(ref, num, shuffle_questions, shuffle_answers, preserve_responses, border_radius, question_alignment, max_width, colors)
173 else:
174 #print("File detected")
175 script += f"var questions{div_id}="
--> 176 with open(ref) as file:
177 for line in file:
178 script += line
FileNotFoundError: [Errno 2] No such file or directory: 'q_w1'
Mini Batch GD#
Mini batch gradient descent is actually a mixture of Batch Gradient Descent and SGD. It computes the gradient of the function over small random subsets of the elemets (mini batches).
The uniqueness of the Mini Batch GD compared to the SGD is that instead of using only 1 random index, the subset of indexes is used. Then we do the following:
The subset of indexes can be considered i.e \(i_1 < i_2 < ... < i_B\), where \(1 < B < \ell\)
In Mini Batch GD, the function \(E(\boldsymbol w)\) is calculated over the subset of indexes i.e. \(E_{B}(\boldsymbol w) = \sum\limits_{j=1}^B \mathcal L_{i_j}(\boldsymbol w)\)
The vector \(\boldsymbol w\) is calculated over the subset of indexes i.e. \(\boldsymbol w = \boldsymbol w - \eta \sum\limits_{j=1}^B \nabla \mathcal L_{i_j}(\boldsymbol w)\)
Considering the difficulty using Big O notation for Mini Batch GD algorithm, we see that during 1 iteration we perform the summation of \(B\) number of gradients’ functions with the \(n\) sized vectors \(\boldsymbol w\). So, the Big O notation is O(\(B n\)).
Mini Batch GD VS SGD#
Compared to the SGD, Mini Batch GD has less oscillations. It also requires more memory and computational resources than SGD. The graph below represents the value of \(Q (\boldsymbol w)\) at each iteration step.
As it can be shown, Mini Batch GD converges faster than SGD because we can make more precise update to the parameters by calculating the average of the \(\mathcal L (\boldsymbol w)\).
Heuristics#
According to the description, “heuristics” is understood as a set of techniques as well as the methods that facilitate and simplify the solution of corresponded tasks. Here is the list of suggested heuristics during the work with the SGD:
Initialization of the vector \(\boldsymbol w\)
Choice for learning rate \(\eta\)
Initialization of the vector \(\boldsymbol w\)#
One of the crucial steps during SGD algorithm is initialization of the \(\boldsymbol w\). It is common to set zero vector as initial values, but there are also other approaches.
So, w = 0;
Another way is to take small random \(w\), since if initialize large initial values it can lead to an area with small modulo derivatives and therefore it will be difficult to get out of such an area.
So, \(w_j\) = random(\(-\frac 1{2n}\), \(\frac 1{2n}\)) (The denominator is devided by “\(2n\)” because if case of \(n\) = 1 the learning rate would be 1, but that’s an unppropriate value. So, it can be at least 0.5 in this case.)
Dynamic Learning Rate \(\eta\)#
As an optimization, learning rate variable \(\eta\) value can be dynamicly changed at each step of iteration. Replace \(\eta\) with a time-dependent learning rate \(\eta(t)\). There is a list of some commonly used formulas for \(\eta(t)\).
\(\; \eta(t)=\frac 1{t}\), inverse (hyperbole) decay
\(\; \eta(t)=\eta_o \cdot e^{-\lambda t}\), exponential decay
\(\; \eta(t)=\eta_o \cdot (\beta t + 1)^{-\alpha}\), polynomial decay
Note
If \(\eta(t)\) changes too quick, then stop optimizing prematurely. If we decrease it too slowly, we waste too much time on optimization.
Constant learning rate VS Dynamic learning rate#
Firstly, look at the SGD with \(\eta\) = 0.1
Advantages and Disadvantages#
As any algorithm, the SGD has its own pros and cons.
Advatages:
Simple for implementation
Suitable for the exercises with huge number of data (input points)
Requires less memory and computational resources than GD
Disadvantages:
It can simply get stuck in local optima
It has strong fluctuation
Optimizers of Gradient Algorithms#
The main disadvantage of gradient algorithms is getting stuck in local optima (minimum).
Moreover, at such points, the derivative is zero, which means that the further change of vector parameters \(\boldsymbol w\) according to the formula (6) ( \(\nabla Q(\boldsymbol w) = \nabla L_k(\boldsymbol w)\) = 0 ) will not happen (we will stuck).
In addition, stochastic gradient algorithms can form strong fluctuations when moving to the minimum point of the function
For solving above problems there were dedicated several solutions. See below
Momentum Gradient Descent#
It was proposed to average the gradient in steps using the formula:
where, \(\boldsymbol v\) - accumulates the average of the gradients
\(\;\;\;\;\;\;\;\;\;\gamma\) - parameter (using it, you can adjust how many past gradients we will take into account in the value of \(\boldsymbol v\))
Note
if rewrite the \((1-\gamma)\eta_t\) from the formula {eq}’momentum-formula’ to \(\eta_t = (1-\gamma)\cdot\eta_t\) we obtain the algorithm: \(\boldsymbol v = \gamma \boldsymbol v + \eta_t \cdot \nabla Q(\boldsymbol w)\)
Finally, just change the \(\boldsymbol w\) using the familiar principle:
where, \(\boldsymbol v\) - smoothed gradients
How does the \(\gamma\) influences the number of last gradients averaged?
The expression shows how the number of last gradients are considered in for \(\boldsymbol v\) calcucation \(N \approx \frac 2 {(1-\gamma)} + 1\)
It is an approximate formula because the in exponential moving average all the gradients are averaged that were met.
NAG - Nesterov’s accelerated gradient#
Taking into account the already mentioned formula in momentum:
where, \(\eta_t = (1-\gamma)\cdot\eta_t\) for simplicity
Now, we can represent the “\(\boldsymbol v\)” as the sum of the vectors “\(\gamma \boldsymbol v\)” and “\(\eta_t \cdot \nabla Q(\boldsymbol w)\)”.
But at the time of calculating the gradient, we already know that we will shift by the value of “\(\gamma \boldsymbol v\)”. Therefore, it would be more correct to calculate the gradient not at point \(\boldsymbol w\), but at point (\(\boldsymbol w - \gamma \boldsymbol v\)) or:
The detailed step by step NAG visualization algorithm to quadratic function (\(f(x) = x^2\)) is shown below. The moving ahead step is also added.
Mathematically NAG can be expressed as:
And such an approach, as a rule, shows the best convergence to the optimum point.